MARTT: Using Induced Knowledge Base to Automatically Mark up Plant Taxonomic Descriptions with XML

نویسنده

  • Hong Cui
چکیده

Despite the sub-language nature of taxonomic descriptions of plants, researchers warned about the large variations among different collections of descriptions in terms of information contents and presentations. These variations impose a serious challenge to the development of automatic tools for the semantic markup of large volumes of freetext descriptions. This paper presents a new approach to automatic markup of multiple collections of taxonomic descriptions with XML. The effectiveness of the approach was demonstrated with markup experiments using three contemporary floras. The markup system, MARTT, was based on supervised machine learning algorithms and enhanced by machine learned association rules representing certain types of domain knowledge and conventions. Experiments showed that our simple and efficient markup algorithm outperformed popular general-purpose algorithms (including SVMs) across different floras. More importantly, the results demonstrated that the domain knowledge learned from one flora was useful for improving the markup performance on a second flora, especially on elements with sparse training examples. The system design and the evaluation of markup algorithms are reported in this paper. The study on the effectiveness of the induced knowledge base will be reported in a later paper. In this paper, common practices of flora authors and the potentials of MARTT system for improving the efficiency and effectiveness of the creation, organization, and utilization of plant descriptions are also discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MARTT: A General Approach to Automatic Markup of Taxonomic Descriptions with XML

Despite the sub-language nature of taxonomic descriptions of animals and plants, researchers have warned about the existence of large variations among different description collections in terms of information content and its representation. These variations impose a serious threat to the development of automatic tools to structure large volumes of text-based descriptions. This paper presents a ...

متن کامل

Step – By – Step

Approaches to formalisation of medical guidelines can be divided into model–centric and document–centric. While model–centric approaches dominate in the development of clinical decision support applications, document–centric, mark–up–based formalisation is suitable for application tasks requiring the 'literal' content of the document to be transferred into the formal model. Examples of such tas...

متن کامل

Who-Does-What: A Knowledge Base of People's Occupations and Job Activities

We present a novel resource called “Who-Does-What” (WDW), which provides a knowledge base of activities for classes of people engaged in a wide range of different occupations. WDW is semi-automatically created by automatically extracting structured job activity descriptions from the Web (we use here the O*Net website). These descriptions are used to populate the taxonomic backbone provided by t...

متن کامل

Digitising legacy zoological taxonomic literature: Processes, products and using the output.

By digitising legacy taxonomic literature using XML mark-up the contents become accessible to other taxonomic and nomenclatural information systems. Appropriate schemas need to be interoperable with other sectorial schemas, atomise to appropriate content elements and carry appropriate metadata to, for example, enable algorithmic assessment of availability of a name under the Code. Legacy (and n...

متن کامل

Automating XML mark-up using a two stage machine learning technique

We introduce a novel two-stage automatic XML mark-up system, which combines the WEBSOM approach to document categorisation in conjunction with the C5 inductive learning algorithm. The WEBSOM method clusters the XML marked-up documents such that semantically similar documents lie close together on a Self-Organising Map (SOM). The C5 algorithm automatically learns and applies mark-up rules derive...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005